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Abstract 

This report presents a theoretical Btudy of the transmission of infor¬ 
mation In the case of discrete messages and noiseless systems. The study 
begins with the definition of a unit of information (a selection between 
tvo choices equally likely to be selected), and this is then used to deter¬ 
mine the amount of Information conveyed by the selection of one of an 
arbitrary number of choices equally likely to be selected. Next, the average 
amount of Information per selection Is computed In the case of messages con¬ 
sisting of sequences of Independent selections from an arbitrary number of 
choices vlth arbitrary probabilities of their being selected. A recoding 
procedure Is also presented for Improving the efficiency of transmission by 
reducing, on the average, the number of selections (digits or pulses) re¬ 
quired to transmit a message of given length and given statistical character. 
The results obtained in the case of sequences of independent selections are 
extended later to the general case of non-independent selections. Finally, 
the optimum condition is determined for the transmission of information by 
means of quantised pulses when the average power Is fixed. 



THE TRANSMISSION OP INFORMATION 


Introduction 

It Is the opinion of many workers in the field of electrical communi¬ 
cations that the communication art is today at a major turning point of its 
development. The objective of almost all electrical communication systems 
has been, up to now, to eliminate distance in some form of human activity or 
relationships between men. Telegraph, telephone and television are typical 
examples of such communication systems. We may add to these teletype, tele¬ 
control and telemetering. It is interesting to note that the names of all 
these communication systems involve the prefix tele . meaning "at a distance". 

Although, for obvious reasons, forms of communication over distances 
much greater than the ranges of human senses and reach were first to receive 
attention, the magnitude of the distance involved is not of primary impor¬ 
tance from a logical point of view in the concept of communication. Com¬ 
munication is basically any form of transmission of information, regardless 
of the distance between the transmitter and the receiver. In a broader 
sense, the field of communication includes any handling, combining, comparing 
or employing of information, since such processes involve and are intimately 
connected with the transmission of such information. 

It is clear, then, that most human activities Involve communication in 
a broad sense, and, in particular, those activities which are considered of 
higher intellectual type because they depend to a high degree on the process 
of "thinking". Thinking itself, in fact, involves a natural communication 
Bystem of a complexity far beyond that conceivable for any man-made system. 

The above considerations point clearly to a very wide field of useful 
applications of the communication art which has hardly been touched as yet. 

It is to be expected that each application should present problems of a 
higher order of complexity than those encountered in the past. Consequently, 
it is also to be expected that the solution of these problems should neces¬ 
sitate the use of more powerful analytical tools and, particularly, should 
require a more fundamental study of the process of transmission of informa¬ 
tion. As a matter of fact, the first and most significant step in the 
direction of such a study was made by Norbert Wiener (1) in connection with 
the development of predictors for antiaircraft fire control. The statistical 
nature of this problem led him to the realization that all communication 
problems are fundamentally of a statistical nature, and must be handled 
accordingly. He argued that the signal to be transmitted in a communication 
system can never be considered as a known function of time, because if it 
were a priori known it could not convey any new information and therefore 
would not need to be transmitted. On the other hand, what can be known 
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a priori about & signal to be transmitted is its statistical character — 
that is, for Instance, the probability distribution of Its amplitude. In 
addition. It Is equally clear, that noise, which plays such an Important 
part in communication problems, oan be described only in statistical terms. 

It follows that all communication problems are Inherently statistical In 
nature, and that disregarding this fact may lead to unexplainable inconsist¬ 
encies In addition to precluding a deeper understanding of such problems. 

The statistical theory of optimum prediction and filtering developed 
by Wiener led further to the realization of the need for a basic and general 
criterion for judging the quality of communication systems. In fact, the 
mean-square error criterion used by Wiener In this part of his work 1 b dic¬ 
tated by mathematical convenience rather than by physical considerations; 
consequently It may not be useful in certain practical problems. The search 
for a more appropriate criterion leads naturally to the question of what Is 
the operation that a communication system must perform. If ve take as an 
example a telegraph system. It might seem at first obvious that such a system 
must reproduce at the output each and every letter of the input message in 
the proper order. We may observe, however, that if one letter is received 
Incorrectly, the word containing it is still perfectly understandable In 
most cases, and so, of course. Is the whole message. Moreover, the message 
would still be comprehensible If, for Instance, all the vowels were elimi¬ 
nated (which is what is done In written Hebrew). On the other hand, the 
Incorrect transmission of a digit In a number would make the received mes¬ 
sage Incorrect. 

It appears therefore that the transmission of the information conveyed 
by a written message is what we wish to obtain and that this is not neces¬ 
sarily equivalent to the transmission of all the letters contained in the 
written message. More precisely, it appears that the different symbols, 
letters or figures contained in a written message do not contribute equally 
to the transmission of information — so much so, that some of them may be 
completely unnecessary. Similar conclusions are reached by considering 
other types of communication systems. In particular, the recent work on 
the Vocoder (2) and the clipping of speech waves (3) has provided consider¬ 
able evidence in the same general direction. 

The above considerations are relevant to another problem with which 
communication engineers are becoming more and more concerned, namely, that 
of bandwidth reduction. As a matter of fact, the Vocoder was developed 
primarily for the purpose of reducing the bandwidth required for speech 
transmission. It is clear that If different parts of a message are not 
equally Important, some saving in bandwidth might be possible by providing 
transmission facilities which are proportional to the Importance of these 
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different parte. The bandwidth problem. In turn. Is intimately connected 
with the noise-reduction problem. In fact, all the different types of 
modulation developed for the purpose of noise and interference reduction 
require a bandwidth wider than that required by amplitude modulation. This 
method of paying for an Improved signal-to-nolse ratio with an increased 
bandwidth appears to be the result of some fundamental limitation which, 
however, the conventional approach to communication problems has failed to 
clarify. 

The above discussion of some of the problems confronting or likely to 
confront the communication engineer indicates clearly the neoesslty of pro¬ 
viding a measure for the "thing 11 which is to be transmitted and which has 
been vaguely called "information". Such a measure will then permit a quan¬ 
titative and more fundamental study of the process Involved in the trans¬ 
mission of information which. In turn, will lead eventually to the design 
of better and more efficient communication devices. A considerable amount 
of work in this direction has already been done Independently by Horbert 
Wiener (4) and Claude Shannon (5). The work of Wiener is particularly out¬ 
standing because of Its philosophical profoundness and Its Importance in 
many branches of science other than communication engineering. Mention 
should be made also of the pioneering work of Hartley (6) and of the more 
recent work of Tuller (7). 

This paper presents the work done by the author In the past year on 
the transmission of discrete signals through a noiseless channel. Although 
most of the results obtained have already been published by Wiener and 
Shannon, it la felt that the method of approach used here is sufficiently 
different to justify this redundant presentation. 

I. Definition of the Unit of Information 

In order to define, in an appropriate and useful manner, a unit of 
information, we must first consider In some detail the nature of those 
processes in our experience which are generally recognized as conveying 
Information. A very simple example of such processes is a yes-or-no answer 
to some specific question. A slightly more involved process is the Indica¬ 
tion of one object In a group of N objects, and, in general, the selection 
of one choice from a group of N specific choices. The word "specific" is 
underlined because Buch a qualification appears to be essential to these 
information-conveying processes. It means that the receiver Is conscious 
of all possible choices, aB Is, of course, the transmitter (that is, the 
individual or the machine which is supplying the information). For Instance, 
saying "yes" or "no" to a person who has not asked a question obviously does 
not convey any information. Similarly, the reception of a code number which 
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1b supposed to represent a particular message does not convey any informa¬ 
tion unless there is available a code book containing all the messages vlth 
the corresponding code numbers. 

Considering next more complex processes, such as writing or speaking, 
ve observe that these processes consist of orderly sequences of selections 
from a number of specific choices, namely, the letters of the alphabet or 
the corresponding sounds, Furthermore, there are indications that the sig¬ 
nals transmitted by the nervous system are of a discrete rather than of a 
continuous nature, and might also be considered as sequences of selections. 

If this were the case, all information received through the senses could be 
analysed in terms of selections. The above discussion indicates that the 
operation of selection forms the baBis of a number of processes recognised 
as conveying information, and that it is likely to be of fundamental impor¬ 
tance in all such processes, Ve may expect, therefore, that a unit of 
information, defined in terms of a selection, will provide a useful basis 
for a quantitative study of communication systems. 

Considering more closely this operation of selection, ve observe that 
different informational value is naturally attached to the selection of the 
same choice, depending on how likely the receiver considered the selection 
of that particular choice to be. For example, ve vould say that little 
information is given hy the selection of a oholce which the receiver vas 
almost sure vould he selected. It seem appropriate, therefore, in order to 
avoid difficulty at this early stage, to use in our definition the particular 
case of equally likely choices — that is, the case in which the receiver has 
no reason to expect that one choice vlll he selected rather than any other. 

In addition, our natural concept of information Indicates that the informa¬ 
tion conveyed by a selection increases vlth the number of choices from vhlch 
the selection is made, although the exact functional relation between these 
two quantities is not immediately clear. 

On the basis of the above considerations, it seems reasonable to define 
as the unit of information the simplest possible selectio n, namely, the 
selection between two equally likely choices, called, hereafter, the "ele¬ 
mentary selection". For completeness, ve must add to this definition the 
postulate, consistent vlth our intuition, that IV independent selections of 
this type constitute N units of information. By independent selections ve 
mean, of course, selections vhlch do not affect one another. Ve shall adopt 
for this unit the convenient name of "bit" (from "binary digit"), suggested 
hy Shannon. Ve shall also refer to a selection between two choices (not 
necessarily equally likely) as a "binary selection", and to a selection from 
N choices, as an H-order selection. Vhen the choices are, a priori , equally 
likely, ve shall refer to the selection as an "equally likely selection". 
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We can nov proceed to develop ways of measuring the information content of 
discrete messages in terms of the unit just defined. Most of this paper 
will be devoted to the solution of this problem. 


II. Selection from H Equally Likely Choices 

Consider nov the selection of one among a number, I, of equally likely 
choices. In order to determine the amount of information corresponding to 
suoh a selection, ve must reduce this more complex operation to a series of 
independent elementary selections. The required number of these elementary 
selections vlll be, by definition, the measure in bits of the information 
given by such an H-order selection. 

Let us assume for the moment that IT is a pover of two. In addition 
(just to make the operation of selection more physical), let us think of 
the IT choices as V objects arranged in a rov, as indicated in Figure 1. 


Binary 

Number 


r o~~i ooo 



3rd 

Div. 


Fig. 1 Selection procedure for 
equally likely choicee. 


These N objects are first divided in tvo equal groups, bo that the object 
to be selected is just as likely to be in one group as in the other. Then 

the indication of the group containing the desired object is equivalent to 

one elementary selection, and, therefore, to one bit. The next step con¬ 
sists of dividing each group Into tvo equal subgroups, so that the object 

to be selected is again just as likely to be in either subgroup. Then one 

additional elementary selection, that is a total of tvo elementary selec¬ 
tions, vlll suffice to Indicate the desired subgroup (of the possible four 
subgroups). This process of successive subdivisions corresponding ele¬ 
mentary selections is carried out until the desired object is Isolated from 
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the others* Two subdivision* are required for H » 4, three for H ■ 8, and, 
in general, a number of subdivisions equal to log^H, in the case of an 
E-order selection. 

The same prooess can be carried out in a purely mathematical form by 
assigning order numbers from 0 to R-l to the R choices. The numbers are 
then expressed in the binary system, as shown in Figure 1, the number of 
binary digits (0 or 1) required being equal to loggR. These digits represent 
an equal number of elementary selections and, moreover, correspond in order 
to the successive divisions mentioned above. In conclusion, an H-order, 
equally lively selection conveys an amount of information 

H]J > loggB . (X) 

The above result is strlotly correct only if I is a power of two, in 
which case % is an integer. If II is not a power of two, then the number of 
elementary selections required to specify the desired choice will be equal 
to the logarithm of either the next lower or the next higher power of two, 
depending on the particular choice selected. Consider, for instance, the 
case of I - 3. The three choices, expressed as binary numbers, are then 

00 ; 01 ; 10 

If the binary digits are read in order from left to right, it is clear 
that the first two numbers require two binary selections — that is, two 
digits, while the third number requires only the first digit, 1, in order to 
be distinguished from the other two. In other words, the number of elemen¬ 
tary selections required when N is not a power of two is equal to either one 

of the two integers closest to logjE. It follows that the corresponding 
amount of information must lie between these two limits, although the sig¬ 
nificance of a non-integral value of E is not clear at this point. It will 
be shown in the next section that Eq.(l) is still correct when H is not a 
power of two, provided is considered as an average value over a large 
number of selections. 

III. Messages and Average Amount of Information 

Ve have determined in the preceding section the amount of information 
conveyed by a single selection from R equally likely choices. In general, 
however, ve have to deal with not one but long series of such selections, 

which ve call messages. This is the case, for instance, in the transmission 

of written intelligence. Another example is provided by the communication 
system known as pulse-code modulation, in which audio waves are sampled at 
equal time Intervals and then each sample is quantized, that is approximated 
by the closest of a number R of amplitude levels. 
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Let us consider, then, a message consisting of a sequence of n succes¬ 
sive B-order selections. We shall assume, at first, that these selections 
are Independent and equally likely. In this simpler oase, all the different 
sequences which can be formed equal In number to 

3 - H“ , (2) 


are equally likely to occur. For instance, in the case of If - 2 (the two 
choices being represented by the numbers 0 and l) and n =* ?, the possible 
sequences would be 000, 001, 010, 100, 011, 101, 110, 111. The total number 
of these sequences is 3 - 8 and the probability of each sequence is 1/8. 

In general, therefore, the ensemble of the possible sequences may be con¬ 
sidered as forming a set of 3 equally likely choices, with the result that 
the selection of any particular sequence yields an amount of information 

H fl - log^S - n log^lf. (?) 


In words, n independent equally likely selections give n times as much 
information as a single selection of the same type. This result is certainly 
not surprising, since it is just a generalization of the postulate, stated 
in Section II, which forms an integral part of the definition of information. 

It is often more convenient. In dealing with long messages, to use a 
quantity representing the average amount of Information per B-order selection, 
rather than the total Information corresponding to the whole message. We 
define this quantity In the most general case as the total information con¬ 
veyed by a very long message divided by the number of selections In the 
message, and we shall indicate it with the symbol H^, where B is the order 
of each selection. It is clear that when all the selections In the message 
are equally likely and Independent and, in addition, B is a power of two, 

the quantity H is just equal to the Information actually given by each 
B 

selection, that is 


[ B “ n lo ®2 3 " 


(4) 


We shall show now that this equation is correct also when B is not a power 
of two. In which case has to be actually an average value taken over a 
sufficiently long sequence of selections.* 

The number 3 of different and equally likely sequences which can be 
foraed with n Independent and equally likely selections is still given by 
Eq.(2), even when B is not a power of two. On the contrary, the number of 
elementary selections required to specify any one particular sequence must 


The author is indebted to Mr. T. P. Cheatham, Jr. (of this Laboratory) for the 
original idea on which is based both this proof and the corresponding recoding 
procedure (see Section IV). 
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be written nov In the form 

Bg - log 2 3 + d , (5) 

vhere d is a number, smaller In magnitude than unity, which makes Bg an 
integer and which depends on the particular sequence selected. The average 
amount of Information per H-order selection is then, by definition, 

H„ = ltm |(log 2 S+d) . (6) 

Since V is a constant and since the magnitude of d is smaller than unity 
while n approaches infinity, this equation together with Eq.(2) yields 

H„ . log^B . (7) 

We shall consider now the more complex case in which the selections, 
although still independent, are not equally likely. In this case, too, we 
wish to compute the average amount of information per selection. For this 
purpose, we consider again the ensemble of all the messages consisting of 
n independent selections and we look for a way of indicating any one partic¬ 
ular message by means of elementary selections. If we were to proceed as 
before, and divide the ensemble of messages in two equal groups, the selec¬ 
tion of the group containing the desired message would no longer be a 
selection between equally likely choices, since the sequences themselves 
are not equally likely. The proper procedure is now, of course, to make 
equal for each group not the number of messages in it but the probability 
of its containing the desired message. Then the selection of the desired 
group will be a selection between equally likely choices. This procedure 
of division and selection is repeated over and over again until the desired 
message has been separated from the others. The successive selections of 
groups and subgroups will then form a sequence of Independent elementary 
selections. 

One may observe, however, that it will not generally be possible to 
form groups equally likely to contain the desired message, because shifting 
any one of the messages from one group to the other will change, by finite 
amounts, the probabilities corresponding to the two groups. On the other 
hand, if the length of the messages is increased indefinitely, the accuracy 
with which the probabilities of the two groups can be made equal becomes 
better and better since the probability of each individual message approaches 
zero. Even so, when the resulting subgroups include only a few messages 
after a large number of divisions, it may become Impossible to keep the 
probabilities of such subgroups as closely equal as desired unless we pro¬ 
ceed from the beginning in an appropriate manner as indicated below. The 
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messages are first arranged in order of their probabilities, which can be 
easily computed if the probabilities of the choices are known. The divisions 
in groups and subgroups are then made successively without changing the order 
of the messages, as Illustrated in Figure 2. In this manner, the smaller 
subgroups will contain messages with equal or almost equal probabilities, so 
that further subdivisions can be performed satisfactorily. 


It is clear that when the above procedure is followed, the number of 
binary selections required to separate any message from the others varies 


Probabilities of Groups 

Obtained 





by Successive 

Divisions 






I 

Div. 

II 

Div. 

III 

Div. 

IV 

Div. 

V 

Div. 

VI 

Div. 

Message 

p(i) 

Recoded 

Message 

P(l)B g (i) 

0.49 






00 

0.49 

0 

0.49 

0.51 


0.14 




01 

0.14 

100 

0.42 


0.28 

0.14 




10 

0.14 

101 

0.42 


0.23 


0.07 



02 

0.07 

1100 

0.28 



0.14 

0.07 



20 

0.07 

1101 

0.28 



0.09 

0.04 



11 

0.04 

1110 

0.16 




0.05 

0.02 


12 

0.02 

11110 

0.10 





0.03 

0.02 

21 

0.02 

111110 

0.12 






0.01 

22 

0.01 

mill 

0.06 


_ 







( B g^av. 

- 2.33 


Fig. 2 Recoding of messages consisting of 2 third-order 
selections, for choice probabilities p(0) » 0.7, p(l) « 0.2, 
p(2) - 0.1, H ? - - [0.7 log^O.7 + 0.2 log 2 0.2 + 0.1 log^ O.lj 
» 1.157. 


For original code 


n - 


loe 2 3 


0.73 


For nev code 


2H 5 

Tin— 

g av. 


0.993 . 
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from message to message. Messages with a high probability of being selected 
require less binary selections than those vlth lover probabilities. This 
fact is in agreement vlth the intuitive notion that the selection of a 
little-probable message conveys more information than the selection of a 
more-probable one. Certainly, the occurrence of an event vhich ve knov 
a priori to have a 99 per cent probability is hardly surprising or, in our 
terminology, yields very little information, vhile the occurrence of an 
event vhich has a probability of only 1 per cent yields considerably more 
information. More precisely, as ahovn belov, if P(i) is the probability 
of the 1^ message, the number of binary selections required to indicate 
this message vlll be an Integer Bg(i) close to -log 2 P(i). In fact, P(l) 
is Just the probability of the last subgroup obtained by successively 
halving (approximately) the probability of the whole ensemble of messages 
(vhioh is unity) a number of times equal to Bg(i), so that P(i) — 

By making the messages sufficiently long — that is, the number n of If-order 
selections sufficiently large — the integer B^(l) can be made to differ In 
percentage from -log 2 P(i) by less than any desired amount. Hence, in this 
limiting case, ve can vrite 


B a (l) - -log 2 P(i) 


( 8 ) 


Let us consider nov a sequence of M selections of messages, each message 
consisting of n H-order selections (forming a sequence of nM selections). 

By making the number M sufficiently large,ve can be practically sure that 
the 1 th message vlll appear In the sequence vlth a frequency as close to 
P(i) as desired. Therefore the number of binary selections required on the 
average to select one message, that is, "the mathematical expectation of 
B 3 ", vill be 


E(Bg) 


0-1 

z 


1=0 


P(l) B s (l) 


(9) 


The average amount of Information per H-order selection is then, from 
Eqs. (8) and (9), 

E(B ) ^~I 

H_ . 11a —- 11a - (|) Y" P(l) log ? P(i) , (10) 

n n-H» n z - 1 £ 

i-0 


that is, the limit of the ratio of the number of binary selection required, 
on the average, to select one message to the number of H-order selections 
in the message. 

Hov let p(k) be the probability of the k ttl choice (of the H), and n^ 
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be the number of times the k th choice Is selected In the 1 th message 
(sequence of n selections). The probability of the 1 message 1 b 

R —1 _/1 \ 

p(d - 77 [p(k)]^ . (id 

k-0 


The number of binary selections required to indicate this message can be 
written as 


B.(l) = -log. 


TT [p(k)]" k(1) 


k-0 


R—1 


-^^^(i) log 2 p(k) (12) 


k-0 


with any degree of accuracy desired. In the limit when n approaches infinity 
these b ina ry selections become elementary selections, that is, binary selec¬ 
tions between equally likely choices. We must now compute E(B S ) according 
to Eq.(9). The number of sequences of selections, that is, messages, to 
which correspond the same values of P(i) and B»(l), is equal to the number 
of different permutations of the choices selected in the i sequence; that 
is, to 

nl 

*=I- 

FT 

k-0 


It follows that the average value of B q (l) is given by 


E(Bg) - -7 


f 

nl 


< 

B-l 

1 r 



1 V 

- 


. k-o 



R-l 


n tpooj 

k-0 


(15) 


R—1 

^loggpfk) 

k-0 


where the n^ and p(k) are always positive and subject to the conditions 

N—1 

* (l4) 


k-0 
R—1 

p0c) = i . 

k-0 


(15) 
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The overall summation In Eq.(l3) le made over all possible combinations of 
Integral positive values of the which satisfy Eq.(l4). 

In order to compute the values of E(Bg) ve begin by expressing the 
factorials in Eq.(l}) by means of Stirling's formula (8)(9). 

ni - /2rrn n 11 e” n , (16) 

valid for large values of n. We obtain then 


ni 

k=t“ 



| f Cp(k)]”* 

k-0 


/2irn n n e" n 

1=1 - 

TT /^n^ 

k-0 



(17) 


where 


<■(*> - 


(H-l)/2 


B-l 


TT[# 


k=0 


k J 


R—1 


TT (*k> 


1/2 


k-0 


( 18 ) 


The variables x k - n^/n are always positive, smaller than unity and subject 
to the constraint 

N—1 

XX - 1 • (19) 

k-0 


It is convenient, at this point, to consider the function f(x) as a 
continuous, rather than a discontinuous, function of the x^ and to transform 
the summation of Eq.(l^) into an Integral. We observe, in this regard, that 
when n^. varies from zero to n,x k varies from zero to one. It follows that 
to a unit increment of n^ (il^ takes only integral values) corresponds an 
increment of x^ equal to 1/n. Therefore, when n approaches infinity, to the 
unit increments of the n^ correspond the differentials dx^ = 1/n. In con¬ 
clusion, the summation of Eq.(l^) can be transformed (10) into an integral 
and Eq.(lO) then becomes 


H - - 


lim 

n-H» 



(W-i 


R—1 


f(x) x k log 2 p(k) 


k-0 


( 20 ) 


The integration is extended over the region of the hyperplane defined by 
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Eq.(l9), la vhlch all the x fc are positive and smaller than one. It will be 
noted that In Eq.(20) x Q Is considered as a function of all the other x fc . 

B-l 

*o - 1 - Z! ’ (21) 

k=l 

so as to limit the Integration to the above-mentioned hyperplane. 

To compute the integral appearing In Eq.(20), ve observe first that the 
Integral of f(x) alone over the same region represents the summation of the 
probabilities of all possible messages consisting of n selections, provided, 
of course, that n is sufficiently large. Therefore, the Integral of f(x) 
must be equal to unity for all large values of n. On the other hand, as 
shown in Appendix I, f(x) has a peak at a point vhlch approaches x^ = p(k) 
vhen n approaches Infinity. The height of this peak Is proportional to 
(B—l)/n . It follows that vhen n approaches infinity, f(x) becomes a delta- 
function, or unit impulse, located at x k = p(k). The Integral of Eq.(20) 

Is, therefore, equal to the value for x^ - p(k) of the rest of the integrand, 
that 1 b, of the eumnation. Eq.(20) yields finally 

B—1 

H N = ~ y~ l i p(k) log 2 p(k) , (22) 

k-0 


vhlch is then the average amount of information per N-order selection. 

The conclusions vhlch can be reached from the evaluation of the Integral 
in Eq.(20) extend far beyond Eq.(22). It is easy to see that if the function 

N—1 

log g p(k) 

k-0 


vere any other finite function of the x^, the limiting value of the integral 
would still be equal to the value of the function for = p(k). In other 
words, the expectation (or average value) of any function of the x^ is equal 
to the value of the function Itself for x^ - p(k). From a physical point of 
vlev, ve can Bay that the ensemble of possible sequences of selections can 
be divided in two groups. The first group consists of sequences for vhlch 
the frequencies x^ of occurrence of the different choices differ from the 
probabilities p(k) of the choices by less than amounts vhlch approach zero 
as l//h" vhen n approaches Infinity. The total probability of the sequences 
in this group approaches unity vhen n increases indefinitely, and therefore 
the number of sequences in this group approaches 
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(23) 


M 


K-l 


TT [p(^)] 

k=0 


-np(k) 



The second group consists of all other sequences, and Its total probability 
approaches zero when n approaches Infinity, 

The sequences of the first group are all equally probable and, there¬ 
fore, the selection of one of them out of the group requires a number of 
binary selections equal to 

loggM - nHjj . (24) 

In other vords, the sequences of the first group can be represented by means 
of sequences of n binary digits, that is digits per N-order selection. 
All the other sequences together, regardless of the vay in vhich they are 
represented, cannot increase by any finite amount,beyond H N , the number of 
binary digits required on the average per E-order selection. 

The expression for E^ obtained above indicates that E^ can be considered 
as the expectation of log^ Cl/p(k)]. In other vords, ve may say that the 
selection of a particular choice k conveys an amount of information equal to 
the logarithm-base-two of the reciprocal of its probability. This inter¬ 
pretation is fundamental. It vlll be shovn later to apply also to the 
general case of non-independent selections, in vhich case p(k) vlll be 
substituted by the conditional probability that the k th choice vlll be selec¬ 
ted, based on the knovledge of all preceding selections. 

It is easy to see from Eq.(22) that vanishes only vhen all but one 
of the p(k) are equal to zero, in vhich case the one different from zero 
must be equal to unity. In other vords, E^ vanishes only vhen the choice 
vhloh will be selected is Imovn a priori vith unity probability. In this 
instance, it is intuitively clear that no information is being transmitted. 

On the other hand, Ejj is a maximum (as shovn In Appendix I), vhen all the 
p(k) are equal, that is, vhen there Is no a priori knovledge at all about 
the selections. Under these circumstances, Eq.(22) reduces to Eq.(7), since 
p(k) - 1/n. The manner in vhich E R varies vith the probabilities of the 
ohoioes is illustrated in Figure 3, for the particular case of N - 2. 

The amount of information conveyed by a message of given length vas 
defined above as the number of independent elementary (binary, equally likely) 
selections required, on the average, to specify such a message. The notion 
of a mini man number of binary selections required did not enter the defini¬ 
tion. It should be intuitively clear, hovever, that the minimum number of 
binary selections required, on the average, to specify a message is equal 
to the average information conveyed, or, in other vords, the number of 
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Fig. 3 The amount of information per 
binary selection as a function of the 
probability of either choice. 
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binary selections becomes a minirmim when the selections are equally likely 
and Independent. To prove this Identity, ve observe that the amount of 
Information conveyed by a sequence of Independent binary selections Is a 
maximum when the selections are equally likely. Conversely, therefore. It 
Is always possible to represent any sequence of m binary, not equally likely 
selections with a number of elementary selections smaller, on the average, 
than m. It follows that no binary representation of a message can be ob¬ 
tained with a number of selections smaller than the amount of Information 
conveyed. It Is clear, of course, that all message representations, which 
employ Independent equally likely selections,require,on the average,the same 
number cf selections. It will be shown later that a larger number of 
selections Is required whenever non-independent selections are used. 

It is appropriate to point out here that the mathematical form of 
Eq.(22) suggests a very Interesting analogy between Information and entropy, 
as expressed In statistical mechanics. In fact, appears formally as the 
entropy of a system whose possible states have probabilities p(k). For a 
physical Interpretation of this analogy, the reader is referred to the work 
of Norbert Wiener (Ref. 1). 

IV. Codes and Code Efficiency 

The preceding sections have been devoted to the definition of the unit 
of information and to the computation of the average amount of Information 
per selection In the case of messages consisting of sequences of Independent 
N-order selections. It was pointed out In Section III that Hg represents 
the minimum number of binary selections required, on the average, to perform 
an N-order selection with given choice probabilities. Therefore, if we take 
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the number of binary selections employed as a basis for comparing different 
methods of conveying the same Information, “h represents a theoretical limit 
corresponding to maximum efficiency. 

The knowledge of such a theoretical limit Is extremely Important, but 
perhaps even more Important la the ability to approach this limit In practice. 
In our case, fortunately, the procedure followed In computing H R (that Is, 
the theoretical limit) Indicates a convenient method for approaching this 
limit In practice. Let us consider again all the sequences of n N-order 
selections (in which, however, n may be a small Integer), and arrange them 
In order of Increasing probability. If we wish to separate any one partic¬ 
ular sequence from the others by means of successive division In almost 
equally probable groups, as discussed In the preceding section, the number 
of divisions required, on the average, that is, E(B 3 ), will be larger 
than nHj. However, If we Increase n, that Is, the length of the sequences, 
we find that E(B s )/n keeps decreasing and approaches when n approaches 
Infinity. It must be kept In mind, In this regard, that E()/n does not 
decrease necessarily In a monotonic manner, but may have an oscillatory 
behavior as a function of n. # It follows that an Increase of n may actually 
produce an Increase of E(Bg)/n. For Instance (as shown In Figure 4), In the 
case of H ■ 2, p(0) - 0.7, p(l) - 0.3, the value of E(B s )/n is 0.905 for 
n - 2, 0.909 for n - 3, and 0.895 for n = 4, the limiting value being 
H £ - 0.882. 

The above discussion indicates that, in transmitting a message consisting 
of a large number of selections, we should transmit the selections not Indi¬ 
vidually, but in sequences of n as units, the number n being as large as 
permitted by practical considerations. The transmission of each of these 
units Is then performed by means of sequences of binary selections corres¬ 
ponding in order to the successive divisions of the ensemble of all possible 
sequences of n N-order selections, as indicated in Figures 2, 4, and 5. It 
will be noted that, although the sequences of binary selections are not equal 
in length. It Is always possible to Identify the end of any of them In a long 
message. In fact, the first m selections of any sequence of length larger 
than m are always different from any of the sequences consisting of exactly 
m selections. 

If It Is desired to perform the transmission by means of N'-order selec¬ 
tions (N 1 being any Integer), we can proceed in the same manner as in the 
case of binary selections, the only difference being that we must divide 
successively the ensemble of all possible sequences in N* groups instead of 
just two. After each division, the groups containing the desired sequence 


This fact vas first pointed out to me by L. G. Kraft of this Laboratory. 
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Fig. ^ Recoding of binary messages for 
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Will then be Indicated by means of an N*-order selection. 

The operation described above is, effectively, a change of code, that 
is, ve may say, of the conventional language In which the message is written. 
Therefore this operation will be referred to as "message recoding". The 
advantage resulting from this recoding is conveniently expressed in terms 
of the code efficiency 

’l “ ' (25) 

that is, the ratio of the information transmitted on the average per selec¬ 
tion, to the information vhlch could be transmitted with an equally likely 
selection of the same order. The efficiency of a binary code resulting from 
the recoding of sequences of N-order selections can be computed most con¬ 
veniently in the form 

nH„ 

' (26) 

where n is the number of N-order selections used in the recoding operation. 
Note that dH^ Is the average amount of information per sequence of n N-order 
selections and E(B 3 ) represents the amount of information vhlch could be 
transmitted, on the average, by one of the sequences of binary selections 
in which the original sequences are recoded, if these binary selections were 
equally likely. If the new code is of N* order, we must substitute for 
E(Bg) the product of 
on the average, to specify a sequence of n N-order selections. 

A final remark must be made regarding the recoding operation. Since 
the process of successive divisions of an ensemble of sequences into equally 
probable groups cannot be carried out exactly, it is not clear at times 
whether one sequence should be included in one group or in another. Of 
course, we wish to perform all divisions in such a way as to obtain at the 
end the most efficient code. Unfortunately, no general rule could be found 
for determining at once how the divisions should be made in doubtful cases 
in order to obtain maximum code efficiency. However, so long as the divi¬ 
sions are made in a reasonable manner the resulting code efficiency will not 
differ appreciably from its maximum value. 

We have implicitly assumed in the foregoing discussion that ve know 
a priori the probabilities p(k) of the choices for a message still to be 
transmitted. It seems appropriate at this point to dlBcuss in some detail 
this assumption, since the practical value of the results obtained above 
depends entirely on its validity. When we state that the probability of a 
particular choice has a value p(k) ve mean that the frequency of occurrence 
of that choice in a message originating from a given source is expected to 
be close to p(k). The longer is the message, the closer ve expect the 


loggH' by the number of N’-order selections required. 


-18- 



frequency to approach p(k). It must he clear, however, that ve have no 
assurance that the frequency of occurrence will not differ considerably from 
the probability even In the case of a very long message, although Buch a 
situation is very unlikely to arise. 

In practice, p(k) must be estimated experimentally following the reverse 
process, that 1 b, by inference from the measurement of the frequency in a 
number of sample messages. If the frequencies in the sample messages are 
reasonably alike, or, more precisely, if their values are scattered in the 
manner which might be expected on the basis of the length of the messages 
used, we may feel relatively safe in taking their average value as a good 
estimate of the probability. In other words, ve may expect that the fre¬ 
quency in any other message originating from the same source will be reason¬ 
ably close to the average value obtained. If this is the case, the source 
of such messages is said to have a stationary statistical character. We oan 
conceive the case, however. In which the frequencies in the sample messages 
available are so widely scattered that hardly any significance can be attrib¬ 
uted to their average value. Such a result may mean that the source has not 
a stationary statistical character, at least for practical purposes, in which 
case the concept of probability loses any physical significance. Fortunately, 
however, the sources of interest appear to have a stationary character for 
any practical purpose. In addition, the estimates of the probabilities of 
the choices do not need to be too close. It should be clear, in this respect, 
that the fact that a code haB been designed for a particular set of choice 
probabilities does not mean that only messages with the same statistical 
character can be transmitted. It means only that such a code will transmit 
most efficiently, that is, with the smallest number of selections — messages 
with the choice frequencies equal to the assumed probabilities. Moreover, 
ve can expect that the efficiency of transmission will not depend in a criti¬ 
cal manner on the actual frequencies of the messages to be transmitted. A 
proof that this is actually the case is given below. 

Suppose that a code which is optimum for a set of choice probabilities 
p'(k) is used to transmit messages with choice probabilities p(k). If ve 
consider again all possible sequences of n selections, the expression for 
the number of binary selections required, on the average, to indicate one 
particular sequence, E(B^), is still given by Eq.(l3), where, however, the 
p(k) which appear in the form log 2 p(k) should be changed into p'(k). It 
follows that, in the limit when n approaches infinity, the number of binary 
selections per N-order selection will approach, according to Eq.(22), the 
value 

Hjj “ - ^ P(k) loggP 1 (k) . (27) 

k-0 
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It is clear from this equation that varies rather slowly with any one of 
the p'(k), unless the corresponding p(k) Is close to zero or unity. Is, 
of course, a minimum when p‘(k) ■* p(k). The case of N ■ 2 Is Illustrated In 
Figure 6 for p(0) - 0.5 and p(0) = 0.7. We may conclude, therefore, that 
the statistical characteristics assumed a priori can he rAther different from 
those of the messages actually transmitted, without the efficiency being 
lowered too much. 



0 0.2 0.4 0.6 0.8 (.0 


P’(o) —~ 


Fig. 6 Behavior of as a function of 
p’(0) for binary messages. 


V. The Case of Non-Independent Selections 


Thus far ve have been considering only messages of a particularly 
simple type, namely, messages consisting of sequences of independent selec¬ 
tions. Obviously, the statistical character of most practical messages is 
much more complex. Any particular selection depends generally on a number 
of preceding selections. For instance, in a written message the probability 
that a certain letter will be an "h" is highest when the preceding letter 
is a "t". In a television signal the light intensity of a certain element 
of a scanning line depends very strongly on the light intensities of the 
corresponding elements in the preceding lines and in the preceding frames. 

In fact, the light intensity is very likely to be almost uniform over wide 
regions of the picture and to remain unchanged for several successive frames. 

The simplifying assumption that any one selection is independent of the 
preceding selections, although quite unrealistic, does not Invalidate com¬ 
pletely the results obtained In the preceding sections, but merely reduces 
their significance to that of first approximations. Intuitively, the average 
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amount of Information conveyed by a sequence of given length is decreased 
by the a priori knowledge of any correlation existing between successive 
selections. Therefore, the value given by Eq.(22) will always be larger 
than the correct value for the average amount of information per selection, 
and the Bame is true of the code efficiency given by Eq.(25). Similarly, 
any recoding operation performed in the manner discussed in Section IV will 
result in a higher efficiency of transmission, but not so high as could be 
obtained by taking into account the correlation between successive selections. 

The procedure for computing the average amount of information per selec¬ 
tion and for recoding messages is still essentially the Bame as that used in 
Sections III and IV, even when the dependence of any selection on the pre¬ 
ceding selections is taken into account. The only difference is that the 
probability of a particular sequence will not be equal simply to the product 
of the probabilities of the choices in it, since these are no longer inde¬ 
pendent. We must still arrange all the possible sequences of given length 
n in order of probability, and separate the desired sequence by successive 
divisions of the ensemble of sequences in groups as equally probable as 
possible. The number of divisions required, on the average, divided by the 
number n of selections will approach H„ when n approaches infinity. 

^ 4- U 

Let P n (i) be the probability of the i sequence of n selections, and 
H s (n) the average amount of information per eequence of n selections when 
successive sequences are assumed to be independent. We have then 

N n —1 

H g(n) - P n (l) log 2 P n (i) . (28) 

i=0 

Let us consider next a sequence of n + 1 selections and let P ..(ijk) be the 
conditional probability that the i° sequence (of the 3 = rr 1 sequences of n 
selections) is followed by the k th choice (of the N). We have then 

N—1 r^-i 

P nW l°e 2 P n (i) * (29) 

k=0 1=0 

which, since 

N—1 

Z! Vi (l;k) - 1 * o°) 

k=0 

becomes 

N—1 N n ~l 

H^l) = Hg(n) P n (i) P^djk) log 2 P^djk) . (31) 

k=0 1=0 
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The increment of Information resulting from the (nfl) th selection is then, 
on the average, 

H-1 

V" 1 * - "Z Z P n (l) P i»l (ijk) l0 e2 P iH-l (l:k) • (?2) 

k-0 1=0 

Expressing now H s (n) in terms of the successive increments, ve obtain 

n 

Hjj(n) = . (33) 

m=l 

The final correct value of the average amount of Information per selection 
can then be written in the form 

n 

) ]T]H N (m) . (34) 

m=l 


H K = 11m (l/n 

n-^o 


To proceed further in our analysis, ve must distinguish between two 
types of statistical character of practical importance. Ve shall say that 
the output of a certain source is statistically uniform if each and any 
selection depends in the same manner on the preceding selection, as seems 
to be the case in a written message. Ve shall say that the output is peri¬ 
odically discontinuous if it is possible to divide any output sequence in 
sub-sequences of fixed and equal length, so that each and any selection 
depends in the same manner on the preceding selection of the same sub¬ 
sequence but is independent of all selections of the preceding sub-sequences. 
This is the case when messages transmitted in succession are similar in 
character and equal in length but entirely unrelated to one another, as, for 
example, in facsimile transmission. The above differentiation of statistical 
character is not an exhaustive classification but only a characterization of 
two special cases of practical interest In which different results are ob¬ 
tained . 

Considering now in more detail the increments of information H^Cn+l), 
our intuition indicates that the average amount of information conveyed by 
any additional selection can be, at most, equal to the value obtained when 
the selection is independent of all preceding selections. Mathematically, 
it must be 

N—1 

H n (w-1) « H k ( 1) = ~y~|p(lc) log 2 p(k) . (35) 

k-0 

A proof of this inequality is given in Appendix II. In addition, it is 
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Intuitively clear also that, in the case of uniform statistical character, 
the average amount of information conveyed by the (afl) ttl selection of a 
sequence can be, at most, equal to the amount of information conveyed by the 
n^ selection, since the latter has less preceding selections on vhich to 
depend. Mathematically, ve expect that, for statistically uniform sequences, 

H N (nfl) ^ H N (n) . (56) 

A proof of this inequality is given also in Appendix II. Eq.(?6) is satis¬ 
fied with the equal sign when the (nfl)^ selection, and therefore any fol¬ 
lowing selection, depends only on the n-1 preceding selections. 

Eq. (26) shows that the limit in Eq.(34) is approached in a monotonic 
manner. In addition, we expect H R (m) to approach monotonically a limit with 
increasing m, since the dependence of any selection on the preceding selec¬ 
tions cannot extend, in practice, over an indefinitely large number of selec¬ 
tions. Suppose, for Instance, that this dependence extends only over the 
n o —l preceding selection. Then Hjj(m) becomes constant and equal to Hjj(n o ) 
when m is larger than n Q , and Eq.(}4) yields 

H N " V n o> * < 37 > 

This result is correct, of course, only in the case of statistically uniform 
sequences. 

In the case of a periodically discontinuous statistical character, 

Eq.(?6) is valid only when the n*'* 1 and the (n+l)** 1 selections belong to the 
same sub-sequence. If this is not the case, the (nfl)^ selection must be 
the first selection of a sub-sequence, and therefore is independent of all 
preceding selections. It follows that H^(m) is a periodic function of m with 
period equal to the length n^ of the sub-sequences, and that the limit of 
Eq.(34) is approached in an oscillatory manner. If we compute this limit by 
increasing n in steps equal to n^, it is easily seen that Eq.(?4) yields 

n i 

H H - (V*;) X]®* 0 ") ' 

m-1 

a value larger than that given by Eq.(37), &b was expected. 

The recoding procedure in the case of messages consisting of non-inde¬ 
pendent selections is still the same as in the case of independent selections. 
The efficiency of transmission, still given by Eq.(25), increases (although not 
necessarily monotonically), with the number of selections UBed as units in the 
recoding process, and approaches unity when the number increases indefinitely. 
It is worth emphasizing that in the recoding process any sequence, even if 
statistically uniform, is considered as periodically discontinuous. In fact. 
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the groups of selections recoded as units are effectively sub-sequences 
which are treated as though they were totally unrelated. It follows that, 
if the recoding operation of a statistically uniform sequence is performed 
on groups of n Q selections, the efficiency of transmission after recoding 
can be at most equal to 




“oV n o> 


Z V”) 

m»l 


(39) 


In the case of statistically discontinuous sequences, it would seem 
reasonable to make the number of selections in the recoding groups an 
Integral fraction or multiple of the length of the sub-sequences. 

A final remark is in order regarding the fitting of the recoding pro¬ 
cedure to the statistical character of the messages to be transmitted. It 
may happen, as it does in the case of television signals, that the depend- 
ence of any one selection on the m preceding selection does not decrease 
monotonlc&lly when m increases, but behaves in an oscillatory manner. In 
this case, one should first reorder the selections before recoding, in such 
a manner that selections which are closely related take positions close to 
one another in the sequence. This idea of reordering the selections in the 
sequence can be generalized aB follows. Any type of transmission of Informa 
tlon can be considered as the transmission, in succession, of patterns in 
a two-dimensional or multi-dimensional space, time being one of the dimen¬ 
sions. Then the problem of ordering selections in an appropriate manner 
can be generalized to the problem of how best to scan these patterns. It 
is clear, on the other hand, that such a scanning problem is also at the 
root of the problem of reducing the bandwidth required by television signals 
The generalized scanning problem seems to be, therefore, of fundamental 
practical, as well as theoretical, importance. However, no work can yet be 
reported on this subject. 


VI. Practical Considerations 

The main purpose of this paper was to provide a logical basis for the 
measurement of the rate of transmission of information. It has been shown 
that an appropriate measure for the rate of transmission in the case of 
sequences of selections can be provided by the minimum number of binary 
selections required, on the average, to indicate one of the original selec¬ 
tions. We were then led naturally to consider the problem of actually per¬ 
forming the transmission of the original sequences by means of as few binary 
or higher-order selections as possible. We did not consider, however, the 
physical process corresponding to such selections — that is, their trans¬ 
mission by electrical meanB, 
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A convenient way of transmitting binary selections in a practical 
communication system is by means of pulses with tvo possible levels, one and 
zero. This is just the technique employed in pulse-code modulation. The 
maximum rate at which information can be transmitted in thiB case is simply 
equal to the number of pulses per second which can be handled by the elec¬ 
trical system — which we know to be proportional to the frequency band 
available. However, as soon as we start dealing with electrical pulses 
rather than logical operations like selections, an additional item must be 
considered in the problem, namely, the power required for the transmission. 
In the case of two-level pulses, the average power corresponding to the maxi 
mum rate of transmission of information is equal to one-half the pulse power 
since the zero and one levels are equally probable. 

If pulses with H rather than two levels equally spaced in voltage are 
used, the maximum rate of transmission is equal to log^N times the number of 
pulses per second which can be handled by the system. The average power 
required becomes, in this case. 



k=Q 


where W Q is the power corresponding to the lowest (non-zero) voltage level. 

The theoretical limit stated above for the rate of transmission of 
information certainly has practical significance when the limiting factors 
in the physical problem are the frequency band available and the number of 
pulse levels permitted by technical and economical considerations. It is 
to be noted, in this regard, that the effect of noise is here taken into 
account, to a firBt approximation, by setting a lower limit to the voltage 
difference between pulse levels, and therefore to V Q . For a detailed dis¬ 
cussion of the effect of noise, the reader is referred to the work of 
Shannon (5). 

Eq.(40) shows, on the other hand, that the average power increases 

2 

approximately as N , while the rate of transmission 1 b proportional only to 
log^N. It follows that, if no limitation is placed on the frequency band 
employed, the smallest value of N should be used — that is, two. This value 
has, in addition, the very important practical advantage that the receiver 
is not required to measure a pulse, but only to detect the existence or the 
lack of a pulBe. It might happen, on the other hand, that the frequency 
band and the average power are the limiting factors, while any reasonable 
number of pulse levels can be allowed. This case represents quite a dif¬ 
ferent problem from those considered above, and the maximum rate of trans¬ 
mission of Information is no longer obtained by making the pulse level b 


- 25 - 



(that Is, the choices) equally probable, as one might think at first. For 
example, more than one unit of information per pulse can be transmitted with 
an average power V - W /2, by using pulses vlth three levels not equally 
probable. It seems worth while, therefore, to determine the maximum amount 
of information which can be transmitted per pulse, for a given average power 
W, a minimum level power W Q , and an unlimited number of pulse levels equally 
spaced In voltage. 

Let, therefore, p(0) be the probability of occurrence of the zero level 
(no pulse), and p(k) the probability of the k & level. The amount of Infor¬ 
mation per pulse Is given by 


h = -£ p(k) logpp(k) 

k>0 * 


and the average power by 

v.w o x; p< k > 

k=0 


( 41 ) 

(42) 


We wish to maximize H with respect to the p(k), subject to the condition 
expressed by Eq.(42) and, of course, the usual condition 

no 

p(k) - 1 . (43) 

k=0 


The maximization procedure is carried out in Appendix III, and yields 


H 


max. 



(44) 

(45) 


The values of p(l)/p(0) and p(0) are plotted In Figure 7 as functions 
of W/W . The value of _ 1 b plotted as a function of the same variable 
in Figure 8. The latter curve shows, for Instance, that the maximum amount 
of information per pulse for W=W q /2 is 1.14, that is, 14 per cent higher 
than the value obtained by using two equally probable levels. 

The procedure for approaching in practice the theoretical limit obtained 
above by appropriate recoding of the messages is very similar to that dis¬ 
cussed in Section IV. It differs only in that the ensemble of all sequences 
of given length must now be divided in groups with probabilities p(o), p(l)... 
p(k)..., instead of in equally probable groups. The number of pulse levels 
to be used in practice (it should be Infinite in theory) must be selected 
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Fig. 7 Behavior of p(l)/p{0) and 
p(0) as functions of W/w^. 


Fig. 8 Maximum information per 

pulse, H . as a function of W/w . 
^ ' max.' ' o 


on a compromise basis, and the values of the p(k) must then be readjusted, 
accordingly to make 


N—1 


Z p(k) 

k=0 


1 


In addition to the effect of limitations on the average power, another 
important practical consideration has been neglected in the preceding sec¬ 
tions. All the types of recoding procedures suggested, for approaching in 
practice the theoretical limits derived above, require the use of devices 
capable of storing the information for a certain length of time in both the 
transmitter and the receiver. Such storage devices are needed to stretch 
or compreeB the time scale according to the probability of the group of 
original selections being recoded for transmission. 

Satisfactory storage units are not yet available. In addition, even 
were they available, their use would undoubtedly add considerably to the 
complexity of communication systems. On the other hand, any substantial 
increase of transmission efficiency is fundamentally based on time stretch¬ 
ing. In fact, since the logarithm of the probability of the choice or 
sequence of choices selected is a measure of the information conveyed by 
the selection (see p. 14), the time rate at which information is conveyed 
in actual signals may vary considerably with time. Even so, a communication 
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system must be able to handle at any time the peak rate which may be present 
in the signal. It follows that any system not employing storage devices to 
stretch or compress the time scale is bound to have an efficiency lower than 
the ratio of the average rate to the peak rate at which information is fed 
to it. It is worth mentioning in this connection that in certain types of 
communications, such as telegraph and television, the input and output sig¬ 
nals do not have inherently fixed time scales. This is the same as saying 
that such forms of communication inherently incorporate storage devices. 

In the case of the telegraph, the written messages at the input and at the 
output are effectively storage devices. In the case of television, the 
image to be televised and the cathode-ray tube perform the same function. 

Although no reduction of frequency band for a given noise level can 
be obtained without storage devices, appropriate coding may lead to some 
reduction of average power. This reduction can be obtained by assigning 
sequences of pulses requiring the smallest energy to the most probable 
messages, and vice versa. In the particular case of pulse-code modulation, 
for instance, this can be done as follows. We arrange all digit combinations 
in order of increasing amount of energy required and the sampling levels in 
order of decreasing probability. Ve assign then the digit combinations to 
the sampling levels in the resulting order. Such a coding method requires, 
however, more flexible coding and decoding units than those used in present- 
day systems. 

Before concluding this section, it should be made clear that the 
improvement of transmission efficiency discussed above and the resulting 
possible reduction of bandwidth requirements for a given signal power have 
little to do with the bandwidth reduction obtained by means of the Vocoder 
or other similar schemes. The Vocoder (2), for instance, does not improve 
the efficiency of transmission, but achieves a reduction in bandwidth by 
eliminating that part of the speech signal which is not strictly necessary 
for the mere understanding of the words spoken. Obviously, the recoding 
of messages according to their statistical character and the elimination 
of unnecessary information represent fundamentally different but equally 
important contributions to the solution of the bandwidth-reduction problem. 

Appendix I 

Maximization of f(x) 

In determining the values of the x^ for which f(x), as given in Eq.(l8), 
is a maximum, it is more convenient to operate on the function 

cp(x) = In f(x) (1-1) 
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whose maxima and minima at non-singular points coincide with those of f(x). 
The x k are the variables in the maximization process, but are subject to the 
constraint 

N—1 

k-0 


Using Lagrange's method, we equate to zero the partial derivatives, with 
respect to the x^ of the function 

N—1 

<?(*) + * X] . (1-5) 

k=0 


where A is a constant to be determined later. We obtain then N equations 
of the form 

n [In p^. — (l + In x^) ] — ^ + X= 0 . (l-*0 


It is clear that when n approaches infinity these equations can be satis¬ 
fied simultaneously only when x^ = p^., In which case Eq.(l-2) is also satis¬ 
fied. In addition, the function f(x,i is neither discontinuous nor a minimum 
at the point x^ = p^, so that the existence of a maximum at this point does 
not require any further mathematical proof. 


Maximization of 

The function given by Eq.(22) must be maximized with respect to the 
p(k) which are, of course, subject to the constraint 


N—1 

p(k) - 1 


k=0 


(1-5) 


Following the same method as above, we obtain N equations of the form 


—2 _ 

a p(k) 


_ 

h n + * 


N—1 

& (k) 


k=0 


— ^ ^ [l + In p(k) ] + X *» 0. 


( 1 - 6 ) 


This set of equations can be satisfied only if all the p(k) are equal. 
Again It Is clear that Is neither discontinuous nor a minimum when all 

the p(k) are equal, and therefore it must be a maximum. 
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Appendix II 


Proof That H N (nfl) ^ H N (l) 

We wish to show, first, that the Increment of the amount of informa¬ 


tion 


N—1 N 11 —1 

^(^i) = -]T X V 1 ' log 2 

k =0 1=0 


(il-l) 


is a maximum vhen P ,(l;k) = p(k), the probability of the k th choice, that 


eh-1 


is, when the additional selection is independent of all preceding selections. 
Mathematically, we must maximize the function H^(nfl) with respect to the 
I^ 1 variables P^^ljk), subject to the conditions 


and 


y~! P n (l) Vl (l:k) “ pOO. (II-2) 

1=0 

N—1 

k=0 


Following Lagrange's method, we equate to zero the derivates with 
respect to the t ^ Le function 

N 11 —1 N—1 

V" 1 ) + X XX Vl (l;k) + dc r a M 

1=0 k=0 

where the and \±^ are constants to be determined later. We obtain then, 
for each pair of values of 1 and k, an equation of the form 

F n (l) (l + In P^djk)] - H k P n (l) = 0 . (II-5) 

The solution of the n"- 1 equations of this type, together with Eqs.(lI-2) 
and (lI-3), is clearly 

*i = P n (i) * (II-6) 

^ Vi (1;k ) = ln p W • (n-7) 

Therefore, the increment of information is a maximum for P^^Cijk) = p(k), 
since this is the only point at which a maximum can exist and a maximum must 
exist at some point. This result can also be stated In the form 
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(II-8) 


where 


H„(n+1) « Hjj(l) , 

H—1 

H n ( 1) = - p(k) log 2 p(k) (II-9) 

kcO 


is the average amount of Information per selection, that is, the average 
increment of information, when each selection is independent of all preced¬ 
ing selections. 

Proof That Hjj(eh-I) 4 H N (n) 

Let us consider a sequence of n selections as consisting of a first 

selection followed by a sequence of n~l selections. Let P (hjj) be the 

tli ^ 

conditional probability that the selection of the n choice is followed by 
the selection of the sequence from the N 11- ^ possible sequences of n—1 
selections. Let also P ,(h,j;k) be the conditional probability that the 

i.V J* A-y. 4-Vi 

k choice is selected after the n choice and the j sequence. We shall 
still Indicate with p(k) the probability of the k choice and, similarly, 
with p(h) the probability of the b th choice. Using these new symbols, 
Eq.(ll-l) becomes 

H H (nfl) = 

K—i rr 0-1 -! N—l 

- ^p(h) p„(h,j) p^ch.j.-k) icg^ousk). (n-io) 

h=0 j-0 k=0 


We wish to show that, for a statistically uniform sequence, Hjj(ntl) is a 
maximum when P J>f ^(h,j;k) is independent of h. Mathematically, we must again 
maximize the function Hjj(eh- 1) with respect to the N n+ ’*' variables F^iCh^jk), 
subject to the conditions 


and 


N—1 


k=0 


, W h ’J; k > = 1 


(ii-n) 


H-l 

^p(b) P n (h;j) P^th.jjk) = P^j) P n (jjk) , (II-1?) 

h=0 

where P .(j) is the probability of the j th sequence of n—1 selections, and 

^ J- ,i_y_ 

is the conditional probability that the k choice will be selected 
after the j sequence. These two probabilities must, in turn, satisfy the 
condition 
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(11-13) 


(J) P n (j;k) =» p(k) , 

J=0 

which, however, does not concern us, since it does not involve directly the 
P ttf ^(h,j;k). It must be clear, on the other hand, that the P n (j;k) are kept 
constant in the maximization process. In other words, the dependence of the 
(n+l) selection on the n^l preceding selection is fixed in this case, while 
in the case discussed previously it was allowed to vary. In addition, since 
ve are dealing with a statistically uniform sequence, the (n+l) th selection 
depends on the n— 1 preceding selections in the same manner as the n tpL selec¬ 
tion depends on its n-1 preceding, that is, on all the preceding selections. 

Proceeding in the same manner as in the proof that H N (nfl) ^ 
we find that, for given P n (j;k), the P n+ ^(h,j;k) make H k (ih- 1) a maximum 
when they are independent of h, that is, of the first selection of the 
sequence. Mathematically speaking, the maximum occurs when P^^Ch,j;k) » 
P n (j;k). It follows that Eq.(lI-lO) yields, with the help of Eq.(ll-ll), 

V^max. " 

^ (11-14) 

/ , Vl ( j ) P n^ ;k ^ l°g2 P n^ J;k ^ ° H nj( n ) 

k=0 

ThiB result can also be stated in the form 

H N (n+l) Hjj(n) . (11-15) 

It must be clear that, in the case of non-statistically uniform sequences, 

P„(j;k) may be an entirely different function than that representing the 
n th 

dependence of the n selection on the first n-1 selections of the sequence, 
since, for instance, the (n+l) selection can be entirely independent of 
the preceding selections while the n th selection Is not. It follows, in this 
latter case, that Eq.(ll-l4) is not valid, and H^(nfl) can be as large ae 

H n (D. 



Appendix III 

We wish to maximize the average amount of information per pulse, H, for 
a given average power and an unlimited number of pulse levels equally spaced 
in voltage. Mathematically, this amounts to maximizing the function given 
by Eq.(4l), subject to the conditions Imposed by Eqs.(42) and (43), Follow¬ 
ing Lagrange's method, as in Appendices I and II, we obtain an infinite set 
of equations of the form 
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p 

1 + In p(k) » x + k p. 


(III-1) 


where X and u are Indeterminate constants. The first of these constants, X, 
can be eliminated by subtracting the equation with k=0 from all the other 
equations of the set, which take then the form 

In p(k) - In p(0) - k 2 M- . (Ill--2) 

p 

The remaining constant, M- , Is then eliminated by subtracting k times 
Eq.(lIl-2) — with k-0 — from the other equations of the same set. We obtain 
In this manner a set of equations of the form 

tin p(k) - In p(0) ] - k 2 [in p(l) - In p(0)] - 0 (III-3) 

It follows that 


P(k ) 
P (0) 



Eqs.(42) and (45) can now be written In the forms 


p(o) 


CD 



k.l 


SiH 

_P(°). 


V 

W“ ' 
w o 


(III-*) 


(III-5) 


and 




(III-6) 


The values of p(l)/p(0) are plotted In Figure 7 as functions of W/W o * From 
these values, the p(k) are Immediately obtained by means of Eq.(lII-4). 

The maximum value of the average amount of information H can now be 
obtained without difficulty by substituting for the p(k) in Eq.(4l) the 
values determined above. We have then, after appropriate manipulation of 
the equation. 


H 


max. 


- p(0) 



+ logpP(O) 

p(o) 




- log o p(0) - p(0) 


Z 

k«l 


pU) 

k c 

logo 

Ml 

,P(°). 

d 

_p( o) 


(III-T) 


- log Q p(0) - p(0) 


k=l 


Sill 

p(o) 


ik‘ 


log. 


Sill 

p(o) 
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Using nov Eq.(lII-5), ve obtain finally 


E. 


max. 


n- l°6o 44 + log oP (0) 




p(0) 


(III-8) 


The value of v*. 18 plotted in Figure 8 as a function of W/W Q , using the 
values of p(l)/p(0) and p(0) given in Figure 7. 
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